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^ Abstract 

^\ We perform a deeper analysis of an axiomatic approach to the concept of intrinsic dimension of a dataset proposed by us in the 
IJCNN'07 paper. The main features of our approach are that a high intrinsic dimension of a dataset reflects the presence of the curse 
6f dimensionality (in a certain mathematically precise sense), and that dimension of a discrete i.i.d. sample of a low-dimensional 
pianifold is, with high probability, close to that of the manifold. At the same time, the intrinsic dimension of a sample is easily 
corrupted by moderate high- dimensional noise (of the same amplitude as the size of the manifold) and suffers from prohibitevely 
high computational complexity (computing it is an A'^P-complete problem). We outline a possible way to overcome these difficulties. 

^ Key words: intrinsic dimension of datasets, concentration of measure, curse of dimensionaity, space with metric and measure, features, 
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1. Introduction 

An often-held opinion on intrinsic dimensionality of data 
sampled from submanifolds of the Euclidean space is ex- 
pressed in 10] thus: "...the goal of estimating the dimension 
of a submanifold is a well-defined mathematical problem. 
Indeed all the notions of dimensionality like e.g. topologi- 
cal, Hausdorff, or correlation dimension agree for subman- 
ifolds in R''." 

' We will argue that it may be useful to have at one's dis- 
posal a concept of intrinsic dimension of data which behaves 
in a different fashion from the more traditional concepts. 
Our approach is shaped up by the following five goals. 

1. We want a high value of intrinsic dimension to be 
indicative of the presence of the curse of dimensionality. 

2. The concept should make no distinction between con- 
tinuous and discrete objects, and the intrinsic dimension of 
a discrete sample should be close to that of the underlying 
manifold. 

3. The intrinsic dimension should agree with our geomet- 
ric intuition and return standard values for familiar objects 
such as Euclidean spheres or Hamming cubes. 

4. We want the concept to be insensitive to high- 
dimensional random noise of moderate amplitude (on the 
same order of magnitude as the size of the manifold) . 

5. Finally, in order to be useful, the intrinsic dimension 
should be computationally feasible. 

For the moment, we have managed to attain the goals 
(1),(2),(3), while (4) and (5) are not met. However, it ap- 



pears that in both cases the problem is the same, and we 
outline a promising way to address it. 

Among the existing approaches to intrinsic dimension, 
that of [5| comes closest to meeting the goals (2), (3), (5) 
and to some extent (1), cf. a discussion in (Lemma 1 



Uj seems to imply that (4) does not hold for moderate 



noise with E ||a;|| = 0(1), i.e., cr^ 1/d.) 

We work in a setting of metric spaces with measure {mm- 
spaces), i.e., triples {X,d,^) consisting of a set, X, fur- 
nished with a distance, d, satisfying axioms of a metric, and 
a probability measure fi. This concept is broad enough so as 
to include submanifolds of R" (equipped with the induced, 
or Minkowski, measure, or with some other probability dis- 
tribution), as well as data samples themselves (with their 
empirical, that is normalized counting, measure). In Sec- 
tion 2, we describe this setting and discuss in some detail 
the phenomenon of concentration of measure on high di- 
mensional structures, presenting it from a number of differ- 
ent viewpoints, including an approach of soft margin clas- 
sification. 

The curse of dimensionality is understood as a geometric 
property of mm-spaces whereby features (1-Lipschitz, or 
non-expanding, functions) sharply concentrate near their 
means and become non-discriminating. This way, the curse 
of dimensionality is equated with the phenomenon of con- 
centration of measure on high-dimensional structures [isl 
0| , and can be dealt with an a precise mathematical fash- 
ion, adopting (1) as an axiom. 

The intrinsic dimension, 9, is defined for mm-spaces in 
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an axiomatic way in Section 4, following [18l |. 

To deal with goal (2) , we resort to the notion of a dis- 
tance, dconc{X,Y), between two mm-spaces, X and Y, 
measuring their similarity Q . This forms the subject of Sec- 
tion 3. Our second axiom says that if two mm-spaces are 
close to each other in the above distance, then their intrin- 
sic dimension values are also close. In this article, we show 
that if a dataset X is sampled with regard to a probability 
measure ^ on a manifold M, then, with high confidence, 
the distance between X and M is small, and so d{M) and 
d{X) are close to each other. 

The goal (3) can be made into an axiom in a more or less 
straightforward way. We give a new example of a dimension 
function d satisfying our axioms. 

We show that the Gromov distance between a low- 
dimensional manifold M and its corruption by high- 
dimensional gaussian noise of moderate amplitude is close 
to M in the Gromov distance. However, this property does 
not carry over to the samples unless their size is exponen- 
tial in the dimension of R'* (unrealistic assumption), and 
thus our approach suffers from high sensitivity to noise 
(Section 6.) Another drawback is computational complex- 
ity: we show that computing the intrinsic dimension of a 
finite sample is an A'^P-complete problem (Sect. 5.) 

However, we believe that the underlying cause of both 
problems is the same: allowing arbitrary non-expanding 
functions as features is clearly too generous. Restricting the 
class of features to that of low-complexity functions whose 
capacity is manageable and rewriting the entire theory in 
this setting opens up a possibility to use statistical learning 
theory and offers a promising way to solve both problems, 
which we discuss in Conclusion. 

2. The phenomenon of concentration of measure 
on high-dimensional structures 

2.1. Spaces with metric and measure 

As in we model datasets within the framework of 
spaces with metric and measure (mm-spaces). So is called 
a triple {X, d, fi), consisting of a (finite or infinite) set X, a 
metric donX , and a probability measur fi defined on the 
family ^ of all Borel subsetilJ of the metric space {X, d) . 

The setting of mm-spaces is natural for at least three 
reasons. First, a finite dataset X sitting in a Euclidean 
space R'' forms an mm-space in a natural way, as it comes 
equipped with a distance and a probability measure (the 
empirical measure fif,{A) — ^{A)/^{X), where ^{A) denotes 
the number of elements in ^4). Second, if one wants to view 
datasets as random samples, then the domain f2, equipped 
with the sampling measure /i and a distance, also forms an 
mm-space. And finally, theory of mm-spaces is an impor- 



^ That is, a sigma-additive measure of total mass one. 
^ Recall that .'^ is the smallest family of subsets of X closed under 
countable unions and complements and containing every open ball 
Be{x), e >0, X ex. 



tant and fast developing part of mathematics, the object of 
study of asymptotic geometric analysis, see [la lla, ISl and 
references therein. 

Features of a dataset X are functions on X that in some 
sense respect the intrinsic structure of X. In the presence 
of a metric, they are usually understood to be 1-Lipschitz, 
or non- expanding, functions /, that is, having the property 

\f{x)~f{y)\<d{x,y) foraUx,j/eX 

Wc will denote the collection of all real-valued 1-Lipschitz 
functions on X by Lip]^(X). 

2.2. Curse of dimensionality and observable diameter 

The curse of dimensionality is a name given to the sit- 
uation where all or some of the important features of a 
dataset sharply concentrate near their median (or mean) 
values and thus become non-discriminating. In such cases, 
X is perceived as intrinsically high-dimensional. This set 
of circumstances covers a whole range of well-known high- 
dimensional phenomena such as for instance sparseness of 
points (the distance to the nearest neighbour is compara- 
ble to the average distance between two points [2])j etc. It 
has been argued in [l3| that a mathematical counterpart of 
the curse of dimensionality is the well-known concentration 
phenomenon [ll,[l3], which can be expressed, for instance, 
using Gromov's concept of the observable diameter Q. 

Let {X, d, /i) be a metric space with measure, and let k > 
be a small fixed threshold value. The observable diame- 
ter of X is the smallest real number, D = ObsDiamK(X), 
with the following property: for every two points x, y, ran- 
domly drawn from X with regard to the measure /i, and for 
any given 1-Lipschitz function /:X — > R (a feature), the 
probability of the event that values of / at a; and y differ 
by more than D is below the threshold: 

P[\f{x)-f{y)\>D]<K. 

Informally, the observable diameter ObsDiamK(X) is the 
size of a dataset X as perceived by us through a series 
of randomized measurements using arbitrary features and 
continuing until the probability to improve on the previous 
observation gets too small. The observable diameter has 
little (logarithmic) sensitivity to k. 

The characteristic size CharSize (X) of X as the median 
value of distances between two elements of X. The concen- 
tration of measure phenomenon refers to the observation 
that "natural" families of geometric objects (X„) often sat- 
isfy 

ObsDiamK(X„) ^ CharSize (^n) as 7i oo. 

A family of spaces with metric and measure having the 
above property is called a Levy family. Here the parameter 
n usually corresponds to dimension of an object defined in 
one or another sense. 

For the Euclidean spheres §" of unit radius, equipped 
with the usual Euclidean distance and the (unique) 
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rotation-invariant probability measure, one has, asymptoti- 
cally as n —I- oo, CharSize(§") ^ V2, while ObsDiam(§") = 
0{l/^/n). Fig. 1 shows observable diameters (indicated 
by inner circles) corresponding to the threshold value 
K = 10-^° of spheres §" in dimensions n = 3, 10, 100, 2500, 
along with projections to the two-dimensional screen of 
randomly sampled 1000 points. 
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Fig. 1. Observable diameter of the sphere S" 



: 3, 10, 100,2500. 



Some other important examples of Levy families 
3, [Hi] include: 

• Hamming cubes {0, 1}" of two-bit 7i-strings equipped 
with the normalized Hamming distance d(cr, r) = ^^{i:ai ^ 
Ti} and the counting measure. The Law of Large Numbers 
is a particular consequence of this fact, hence the name 
Geometric Law of Large Numbers sometimes used in place 
of Concentration Phenomenon; 

• groups SU{n) of special unitary n x n matrices, with 
the geodesic distance and Haar measure (unique invariant 
probability measure) ; 

• spaces R*^ equipped with the Guassian measure with 
standard deviation a — 1/ ^/n, 

• any family of expander graphs ([9|], p. 197) with the nor- 
malized counting measure on the set of vertices and the 
path metric. 

Any dataset whose observable diameter is small relative 
to the characteristic size will be suffering from dimension- 
ality curse. For some recent work on this link in the context 
of data engineering, cf. and references therein. 

2.3. Concentration function and separation distance 

One of many equivalent ways to reformulate the concen- 
tration phenomenon is this: 

for a typical "high- dimensional" structure X , if A is a 
subset containing at least half of all points, then the mea- 



sure of the e -neighbourhood Ag of A is overwhelmingly 
close to 1 already for small values of e > 0. 
More formally, one can prove that a family (X„, dn, Hn) of 
mTO-spaces is a Levy family if and only if, whenever a Borel 
subset An C Xn is picked up in every X„ in such a way that 
t^n{An) > 1/2, onehas/x„((A„)e) Iforeverye > 0. This 
reformulation allows to define the most often used quantita- 
tive measure of concentration phenomenon, the concentra- 
tion function, ax (e) , of an mm-space {x,d,^), cf. 
One sets a(0) = 1/2 and for all e > 0, 



a{e) 



l-inf |a*(A,):^CX, fi{A) > i 



where A runs over Borel subsets of X. Clearly, a family of 
mm-spaces (Xn) is Levy if and only if the concentration 
functions ax^ (e) converge to zero pointwise for all e > 0. 

Another such quantitative measure is the separation dis- 
tance [9|. Let K > 0. The value sep^(X) of ^-separation 
distance of the mm-space X is the supremum of all 6 for 
which there are Borel sets A, B C X at a, distance > 6 from 
each other which are both sufficiently large: 



fl{A) > K, fJ.{B) > K. 



X 



B 



Fig. 2. To the notion of separation distance. 

By setting in addition sepQ(X) = diam(X), one gets the 
separation function of X , sep (X), which is a non- increasing 
function from the interval [0,1/2] to R, vanishing at the 
right endpoint. 

It is a simple exercise to verify that for all e,K > 

sep„(j)(X) > £ and ax{sep^{X)) < k. 

Thus, a family (X„) of mm-spaces is a Levy family if and 
only if sep^(X„) converge to zero pointwise, cf. Fig. 3. 

2.4. Concentration and soft margin classification 

Here we will explain the concentration phenomenon in 
the language of soft margin classifiers. We will work in the 
setting of |Ji], Subs. 9.2, assuming that the training dataset 
for a binary classification problem is modelled by a sequence 
of i.i.d. random variables distributed according to a proba- 
bility measure i/ on Z — il x {0,1}. Here ft is the domain, 
in our case a metric space, and the classifying functions / 
will be assumed 1-Lipschitz. (For a detailed treatment of 
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Separation functions for unit spfieres of dimension d 




measure (kappa) 



Fig. 3. Separation functions of the Euclidean spheres §" 
n = 3,10,30,100. 



larg e margin classification problem for such functions, see 
|23j.) The margin of a function f'.fl^ {0, 1} on (x, y) £ Z 
is defined as 



margin(/(a;),y) 



m - \ 



For a 7 > (margin parameter) , define the error of / with 
respect to v and 7 as the probability 

er2:(/) = i'{margin(/(a;),?/) < 7}. 

The value 1 — ei2{f) is a measure of how many datapoints 
admit a confident correct classification. 
Theorem 1 Let {Q,d,fi) be a metric space with measure, 
and let v be a probability distribution on Z ^ Qx {0, 1} with 
the marginals equal tofionfl and the Bernoulli distribution 
on {0, 1}. Lei 7 > 0. Then for every 1 - Lips chitz function f, 

erZif) > 1 - 2a(7), 

where a denotes the concentration function of the domain 

{n,d,fi). 

The result easily follows from the definition of the con- 
centration function if one takes into account that the dis- 
tribution v induces a partition of in two Borel subsets of 
measure 1/2 each. Conversely, one can bound the concen- 
tration function in terms of the uniform error: 

a(7) < 1 - super2;(/), 
/ 

where the supremum is taken over all 1-Lipschitz functions 

on ri. 

This formalizes the observation that in datasets suffering 
from dimensionality curse large margin classification with 
1-Lipschitz functions becomes impossible. 



3. Gromov's distance and concentration to a 
non-trivial space 

3.1. Definition 

Gromov's distance between two mm-spaces satisfies the 
usual axioms of a metric and is introduced in such a way 
that a family of mm-spaces forms a Levy family if it 
converges to a one-point space with regard to Gromov's dis- 
tance. Thus, one can say that a dataset X suffers from the 
curse of dimensionality if it is close to a one-point space in 
Gromov's distance. Intuitively, it means that the features of 
X give away as little useful information about the intrinsic 
structure of X as the features of the trivial one-point set, 
that is, not much more can be derived about X from ob- 
servations than about a one-point set. (For a formalization 
of this discussion in terms of Gromov's observable diameter 
of an mm-space, see [3] and Gromov's original book Q.) 

Gromov's distance allows one to talk of concentration to 
a non-trivial space. In a sense, this is what happens in the 
context of principal manifold analysis, where one expects a 
dataset to concentrate to a low-dimensional manifold. 

Let X = {X,dx, fJ-x) and Y — (F, c?y,/iy) be two mm- 
spaces. The idea of Gromov's distance is that X and Y are 
close if every feature of X can be matched against a similar 
feature of Y, and vice versa. For this purpose, one needs to 
represent all the features as functions on a common third 
space. This is achieved through a standard result in measure 
theory. Every mm-space X can be parametrized by the unit 
interval: there is a measurable map 0: [0, 1] ^ X with the 
property that whenever A C X is a. Borel subset, one has 

fix{A) = X{r\A)), 

where A is the Lebesgue measure on [0, 1] and (j)~^{A) = 
{t S [0, 1]: (j){t) G A} is the inverse image of A under cf). 

Introduce the distance mci between measurable func- 
tions on [0, 1] as follows: 

mei(/,<?)=inf{£>0:A{ie [0,1]: 1/(0- g(0| >£}<£}. 

This is indeed a metric, determining the well-known con- 
vergence in measure. 

Now define the Gromov distance dconc{X,Y) between 
two mm-spaces X and Y as the infimum of all £ > for 
which there exist some suitable parametrizations (j)x and 
(j)Y of X and of Y respectively, with the following property. 
For every / e Lipi{X) there is a 5 e Lip]^(F) with 



mei(/o o (/jy) < e. 



(1) 



and vice versa: for every g € h'vpi{Y) there is an / G 
Lipi{X) satisfying Eq. (1). 

Proposition 2 Let X be an mm-space. Then 

{dconciX, {*}) < e/2) ^ (a(e) < e/2) ^ (4o«c(^, {*}) < e) ■ 

PROOF. Suppose 4o«c(^,{*}) < e/2 < l/2andlet A C 
X, m(^) > 1/2. The distance function dA{x) — d{A,x) = 
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inf{d(a, x):a € A} is 1-Lipschitz and so differs from a suit- 
able constant function c by less than e/2 on a set of measure 
> 1 — e/2. Clearly, c < e/2, and so dA can possibly take 
value > e on a set of measure < e/2, meaning a{s) < e/2. 
Conversely, if dconc{X, {*}) > e, there exists a 1-Lipschitz 
function f on X which differs from its median value M = 
Mf by at least e on a set of measure > e. It means the exis- 
tence of two sets, A and B, such that fi{A) > 1/2, ^{B) > 
e/2, and for all a € A, b & B one has |/(a) — /(6)| > e, that 
is, a{e) > e/2. 

E.g. for the spheres the Gromov distance to a point is 
exactly the solution to the equation asn(e) — e/2. Fig. 4. 

Concentration tunctions of d-sptieres and Gromov distances to a singleton 
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d.10 
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Euciidean distance, epsilon 

Fig. 4. Concentration functions of spheres I 
the straight line a = s/2. 



n = 3, 10, 30, 100, and 



Corollary 3 A family (X„) ofmm-spaces is a Levy family 
if and only if it converges with regard to Gromov 's distance 
to the trivial one-point space {*}. 

Remark 4 Notice that in the definition of Gromov's 
distance dconc{X, Y) one can replace throughout the sets 
hipi{X) of all 1-Lipschitz functions with the sets of all 
1-Lipschitz functions f satisfying \\f\\^ < D, where D is 
an upper bound on the diameter of two metric spaces in 
question and \\f\\^ is the supremum norm of f . 

3.2. Gromov and Monge-Kantorovich distances 

Let us compare the Gromov distance to the well-known 
Monge-Kantorovich, or mass transportation, distance (also 
known in computer science as the Wasserstein, or Earth- 
mover's distance), see (20| . Given two probability measures 
/i and J/ on a metric space {X, d) , the mass transportation 
distance between them is 



^{lJ;i^) 



= inf 



d{x,y) dfj. 



XxX 



where rj runs over all probability measures on X x X whose 
marginals are fj, and v, respectively. Thinking of /i and v 
as piles of sand of equal mass, dmassifJ-, v) is the smallest 



average distance that a grain of sand has to travel when 
the first pile is moved to take place of the second. 

Propositions dconc{{X,d, ^),{X,d,v)) < y^dmassilJ-, v)- 

PROOF. Without loss in generality, one can assume both 
^ and V to be non-atomic, and use the coordinate projec- 
tions TTi, i = 1,2, from the measure space {X x X,r]) to 
parametrize the two mm-spaces in question. For every / S 
Lip]^(X) the L^-norm of the difference i^iO f — 112° f satisfies 

\f{x) - f{y)\dr]{x,y) < j d{x,y) drj = d,nass[lJ-,v), 

XxX XxX 

whence the desired estimate follows easily. 

No bound in the opposite direction is possible. For in- 
stance, c?conc(§", {0}) — 0(1), while the mass transporta- 
tion distance between the Haar measure on the sphere S" 
and any Dirac point mass will be at least 1. 

3.3. Sampling 

If (O, d, fj.) is an mm-space and X is a /i-sample of 57, 
then X becomes an mm-space on its own right if equipped 
with the restriction of the distance d and the normalized 
counting measure. The following theorem states that ran- 
dom samples of an mm-space fl will concentrate to it with 
confidence approaching one as the sample size increases. 

Recall that a metric space fi = (f2, d) is totally bounded 
if for every u > it can be covered with finitely many 
open balls of radius u, and the smallest such number, the 
covering number, is denoted iV(f2,u). For instance, every 
compact metric space is totally bounded. 
Theorem 6 Let (fi, d, ^) be a totally bounded metric space 
of diameter one equipped with a non-atomic Borel probabil- 
ity measure /z. Lete, i5 > 0, and letX be a random fi-sample 
of Q, of size 




+ l\ du 



C 1,2 

n> — max \ log ■ 



where C > is an absolute constant. 
Then with confidence > 1 — d one has: 

dconc{^,X) < C. 

PROOF. The Rademacher averages of a class ^ of func- 
tions are capacity measures defined as follows: for every 
n = 1,2,3, .. ., 



i?„(^) = E^Ee ( - sup 



(2) 



where Xi, i — 1,2, ... ,n are i.i.d. sample points according 
to the sample distribution ^, and e^ are i.i.d. Rademacher 
random variables assuming equiprobable values ± 1 . 
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Making use of Remark 4, denote by ^q, the space of all 1- 
Lipschitz functions on U, with ||/||j^ < 1, and similarly for 
^x- By Theorem 18 in for a suitable constant C > 0, 



i?„(^o)<2e+^/jA^(f7,5)log(l + l 



du. (3) 



Since the diameter and covering numbers of X are ma- 
jorized by those of VL, the inequality (3) remains true if ^q, 
is replaced with ■ 

Because (ri,/^) is a standard Borel non-atomic proba- 
bility space, it can be used instead of the unit interval to 
parametrize X (with the normalized counting measure). 
Choose a Borel measurable parametrization (p-.Q. ^ X with 
the property 4'{x) = x for each x E X. Let (l>*^x denote 
the set of all pull-back functions on fl of the form g = focj), 
f G ,^x- Such functions are Borel measurable, though not 
necessarily Lipschitz. By the choice of (j), the Rademachar 
averages of ^x and of (t>*.^x coincide, and so Eq. (3) con- 
tinues to hold with (j)*^x in place of 

Corollary 3 on p. 19 in applied to the hmction class 
^ = together with the inequality (3), implies that, 
under the condition 



C 1,1 

n> max < log - 




7V(l],plogf - + l]du 



u 



(4) 



one has with confidence > 1 — 5 that the empirical mean 
and the expected value of each / e ^ = differ by less 
than e: 



sup 



I " 

II ^— ^ 



< £. 



(Notice that in Eq. (2) we use the normalization by 1/n as 
e.g. in [2^ , while the normalization in 14 1 is by 1/ ^Jn. Also, 
the constant C in Eq. (4) is different from that in Eq. (3).) 

An analogous statement is true of the class ^ = (f>*J^x- 
Consequently, under the assumption (4), with confidence 

> 1 — 2(5, if / e and g S (j}*^x coincide on X, then 

< 2e, which, in its turn, easily implies mci (/, g) < 

For every function / g there is a function from the 
class <j)*^x taking the same values as / at all points of X: 
this is the function {f\x)o (j). The converse is also true: as is 
well known, every 1-Lipschitz function on a subspace of a 
metric space (e.g. X) admits an extension to a 1-Lipschitz 
function on the entire space (in our case, Q), cf. e.g. Lemma 
7 in [2^. if 5 G '^x, there exists a g € extending g, 
and now {g o 4>)\x — g\x- We conclude: with confidence 

> 1 — 25, the Hausdorff distance between and the pull- 
back of to Q is bounded by \/2e. Therefore, 



dconci^.X) < V2e 

with confidence > 1 — 25. Making a substitution 
V2e, Snew = 2(5, we obtain the desired result. 



Since the above result is meant to be applied to low- 
dimensional manifolds, the values of the covering numbers 
are relatively low, and the theorem gives meaningful esti- 
mates for realistically sized sample sets. For e = 0.1 and 
S = 10~^ they are on the order of thousands of points for 
d = 1 (principal curves), tens of thousands for d — 2 and 
millions for d = S. The estimates can be no doubt signifi- 
cantly improved. 



4. An axiomatic approach to inner dimension 
4.1. Axioms 

Let ./# denote some class of spaces with metric and 
measure (possibly including all of them), containing a 
family (X„) of spaces asymptotically approaching the n- 
dimensional unit Euclidean spheres S" with their standard 
rotation-invariant probability measures: 

dconc{XmSi") — > 0. 

These can be Euclidean cubes [0,1]" with the Lebesgue 
measure and the distance normalized by 1/ -^/n, the Ham- 
ming cubes {0, 1}" with the normalized Hamming {£^) dis- 
tance and normalized counting measure, etc. 

Let 9 be a function defined for every member of ^ and 
assuming values in [0,oo) U {00}. We call d an intrinsic 
dimension function if it satisfies the following axioms: 

(i) {Axiom of concentration) A family (X„) of members 
of ^ is a Levy family if and only if d{Xn) t 00. 

(ii) ( Axiom of smooth dependence on datasets) li Xm X € 

and (iconc(A'„, X) O^then 9(A'„) d{X). 

(iii) (Axiom of normalization] ^ I For some (hence every) 
family (A"„) C ^ with the property dconc(Xn, §") = 
0(1) one has d(Xn) — Q(n). 

The first axiom formalizes a requirement that the intrin- 
sic dimension is high if and only if a dataset suffers from 
the curse of dimensionality. The second axiom assures that 
a dataset X well-approximated by a non-linear manifold 
M has an intrinsic dimension close to that of M. The role 
of the third axiom is just to calibrate the values of the in- 
trinsic dimension. 

As explained in [l^l, the axioms lead to a paradoxical 
conclusion: every dimension function defined for all mm- 
spaces must assign to the trivial one-point space {*} the 
value -l-oo. This paradox is harmless and does not lead to 
any contradictions, furthermore one can avoid it is by re- 
stricting the class ^ to mm-spaces of a given character- 
istic size (i.e., the median value of distances between two 
points), which does not lead to any real loss in generality. 



^ Here recall that /(n) = S{g{n)) if there exist constants < c < C 
and an N with c\f{n)\ < \g{n)\ < C\f{n)\ for all n > N. One says 
that the functions / and g asymptotically have the same order of 
magnitude. 
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4.2. Dimension function based on separation distance 

In [3| we gave an example of a dimension function, the 
concentration dimension of X: 



dima(^) 



1 



2 Jo ax{e) de 



(5) 



Here is another dimension function. 
Example 7 The quantity 



dn 



2/o^sepjX)d«; 



(6) 



defines an intrinsic dimension function on the class of all 
mm-spaces X for which the above integral is proper (includ- 
ing, in particular, all spaces of bounded diameter). We call 
it the separation dimension. (Cf. Fig. 5.) 



Separation dimension of the Hamming cube of d-bit binary strings 



separation dimension 
straigfit fine y=x - 




Fig. 5. Separation dimension daep of the Hamming cube {0,1}'*, 
equipped with the normahzed Hamming distance and normaUzed 
counting measure, 11 < d < 169, d odd. 

By judiciously choosing a normalizing constant, one can 
no doubt make the separation dimension of {0, 1}" fit the 
values of n much closer. 

In fact, practically every concentration invariant from 
theory of mm-spaces leads to an example of an intrinsic di- 
mension function, and the chapter 3^ of [9,] is a particularly 
rich source of such invariants. 



4.3. Dimension and sampling 

Most existing approaches to intrinsic dimension of a 
dataset have to confront the problem that, strictly speak- 
ing, the value of dimension of a finite dataset is zero, 
because it is a discrete object. On the contrary, as exampli- 
fied by the Hamming cube {0, 1}" (Fig. 5), our dimension 
functions make no difference between discrete and contin- 
uous mm-spaces. Moreover, the dimension of randomly 
sampled finite subsets approaches the dimension of the 



domain. The following is a consequence of Theorem 6 and 
Axiom 2 of dimension function. 

Corollary 8 Let d be a dimension function, and let f2 = 
(O, d, be a non-atomic mm-space. For every e > 0, S > 
there is a value no = no{d,n, e, S) such that, whenever X 
is a set of cardinality > uq randomly sampled from O with 
regard to the measure /i, one has with confidence > 1 — 5 

\d{n)-d{x)\ < e. 

Jointly with Theorems 1 and 6, the above Corollary im- 
plies the following result which we state in a qualitative 
version. 

Corollary 9 Let d be a dimension function, and let 
{Q,d,fi) be a non-atomic metric space with measure. Let 
v be a probability distribution on Z = x {0, 1} with the 
marginals equal to ^ on ft and the Bernoulli distribution 
on {0,1}. Then for every ^,5,e > there are natural 
numbers no and do with the following property. Assume 
9(ri, d, /i) > do- Let n > no training datapoints be sampled 
from Z according to the distribution v an an i.i.d. fashion. 
Then with confidence 1 — 5, for every \-Lipschitz function 
f the empirical error satisfies 



er; 



where Vn is the empirical measure supported on the sample. 

In other words, an intrinsically high-dimensional dataset 
does not admit large margin classifiers. 

5. Complexity 

For the moment, we don't have any example of a dimen- 
sion function that would be computationally feasible other 
than for well-understood geometrical objects (spheres, 
cubes...). 

Theorem 10 Fix a value < k < 1/2. Determining the 
value sep^(X) of the separation function for finite metric 
spaces X (with the normalized counting measure) is an NP- 
complete problem. 

PROOF. To a given finite metric space X associate a 
graph with X as the vertex set and two vertices x, y being 
adjacent if and only if d{x, y) > k. Now the problem of de- 
termining sep^(X) is equivalent to solving the largest bal- 
anced complete bipartite subgraphproblem which is known 
to be iVP-complete, cf. GT24 in Q. 

6. High-dimensional noise 

Another deficiency of our approach in its present form is 
its sensitivity to noise. We will consider an idealized situa- 
tion where data is corrupted by high-dimensional Gaussian 
noise, as follows. Let /i be a probability measure on the Eu- 
clidean space R*^. Assume that fi is supported on a compact 
submanifold M of R'' of lower dimension m <^ d. li ^ has 
density p^ (that is, is absolutely continuous with regard to 
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the Lebesgue measure), a dataset X being sampled in the 
presence of gaussian noise means 

where 



is the density of the gaussian distribution 7,1 (0, cr^). Equiv- 
alently, X is sampled with regard to the convolution of ^ 
with the d-dimensional Gaussian measure: 

in which form the assumption of absolute continuity of 
^ becomes superfluous. One can think of the mm-space 
(R'', /i * 7„) with the Euclidean distance as a corruption 
of the original domain (M, /i) . We will further assume that 
the amplitude of the corrupting noise is on the same order 
of magnitude as the size of M, that is, E-y^(||.T||) = 0(1), 
or (7 = 

Here is a result in the positive direction. 
Theorem 11 Let M be a compact topological manifold sup- 
porting a probability measure /i. Consider a family of em- 
beddings of M into the Euclidean space R**, d 00 as a 
submanifold in such a way that the Euclidean covering num- 
bers N{M,e), e > 0, grow as o{d). Let {M,fi) be corrupted 
by the gaussian noise 7^(0,(7^) of constant amplitude, that 
is, a = 0(l/-\/d)- Then the Gromov distance between the 
image of (M, fj,) in IV^ and its corruption by 7d(0, cr^) tends 
to zero as d^ 00. 

PROOF. For an £ > 0, let be a finite e-net for M. 
Denote by tt the orthogonal projection from R'* to the linear 
subspace V spanned by F . Let 7r*/i denote the push-forward 
of the measure ji to V , that is, for every Borel A <ZV one 

has (7r*M)(^) = M(7r"^(^))- 

The mass transportation distance between /i and 7r*/i is 
bounded by e, and by Proposition 5 the Gromov distance 
between M and (V, 7r*/i) is bounded by e^/^. A similar argu- 
ment gives the same upper bound for the Gromov distance 
between the gaussian corruption of M and that of (V, 7r*/i) . 

The mm-space (R'*, (vr^/i) * 7„) can be parametrized 
by the identity mapping of itself (because the measure 
is non-atomic and has full support), while the projection 
TT parametrizes the space (y,7r*^) by its very definition. 
If / e Lipi(y), then tt o / e Lipi(R''). Conversely, let 
/ e Lipi(R'^). The fibers ir'^ix), a; € ^ are (d - |F|)- 
dimensional affine subspaces, and the measure induced on 
each fiber by the measure (tt^/i) * 7„ approaches the gaus- 
sian measure 7„_|^| (0, cr^) with regard to the mass trans- 
portation distance as d — *■ 00. The function / obtained 
from / by integration over all fibers 7r^^(x), x £ V be- 
longs to Lip]^(y), and since d — |F| = 0(d), the concen- 
tration of measure for gaussians (p. 140 in [16]) implies 
that for some absolute constant (7, the functions tt o / and 
/oo differ by less than e on a set of (tt^/i) * 7d-measure > 



1 — Oexp (— Oe^/cr^). In particular, if d is large enough, 
the Gromov distance between M and its gaussian corrup- 
tion will not exceed 3-y/e, whence the result follows since 
e > was arbitrary. 

Corollary 12 Under the assumptions of Theorem 11, the 
value of any dimension function d for the corruption of M 
converges to d{M) as d 00. 

Unfortunately, this result does not extend to finite sam- 
ples, because the required size of a random sample of M 
in the presence of noise is unrealistically high: the covering 
numbers of (R'', /i * 7^) go to infinity exponentially fast (in 
d), and Theorem 6 becomes useless. 

As an illustration, consider the simplest case possible. 
Proposition 13 Let M ~ {0} be a singular one-point 
manifold, and let X be sampled from M in the presence of 
gaussian random noise of moderate amplitude, that is, 

X^N{0,cj^), 

where = 1/d. Assume the cardinality of the sample X to 
be constant, \X\ = 0(1). Then the Gromov distance between 
M = {0} and X tends to a positive constant (\/2/?> fa 0A7) 
as d ^ 00. 

PROOF. It is a well-known manifestation of the curse 
of dimensionality that, as d — + 00, the distances between 
pairs of points of X strongly concentrate near the median 
value, which in this case will tend to \/2. Thus, a typi- 
cal random sample X will form, for all practical purposes, 
a discrete metric space of diameter ^ In particular, 
Lipi{X) will contain numerous 1-Lipschitz functions that 
are highly non-constant, and the Gromov distance from X 
to the one-point space M is seen to tend to the value V2/3. 

For manageable sample sizes (up to millions of points) the 
above will already happen in moderate to high dimensions. 
Example 14 For d — 50, a random sample X as above 
of s = 10^ points will contain, with confidence > 0.99, a 
l-separated subset S containing > 95 % of all points (that 
is, every two points of S are at a distance > 1 from each 
other). Consequently, sepQ ^■j;^{X) > 1, and the separation 
dimension dsep{X) will not exceed 1.125. (At the same time, 

dsepiiO}) = +0O.J 

We conclude: the proposed intrinsic dimension of discrete 
datasets of realistic size is unstable under random high- 
dimensional noise of moderate amplitude. 

7. Comparison to other approaches 

7.1. The intrinsic dimensionality of Chavez et al. 

The following interesting version of intrinsic dimension 
was proposed by Chavez et al. @ who called it simply in- 
trinsic dimensionality. Let {X, d, fi) be a space with met- 
ric and measure. Denote by m(d) the mean of the distance 
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function d: X x X ^ Ron the space X x X with the prod- 
uct measure. Assume m(d) < oo. Let cr{d) be the standard 
deviation of the same function. The intrinsic dimensional- 
ity of X is defined as 



diuidistiX) 



2a^{d)' 



(7) 



Theorem 15 The intrinsic dimensionality satisfies: 

- a weaker version of Axiom 1: if (X„,c?„,/i„) is a 
Levy family of spaces with hounded metrics, then 
Aimdist{Xn, {*}) oo, 

- a weaker version of Axiom 2: if dconc{Xn, X) —> and 
m{dn) m{d), then Aimdist{Xn) d\iadist{X), 

- Axiom 3. 

For a proof, as well as a more detailed discussion, see [l^l , 
where in particular it is shown on a number of examples 
that the dimension Chavez et al. and our dimension can 
behave in quite different ways between themselves (and of 
course from the topological dimension). 

7.2. Some other approaches 

The approaches to intrinsic dimension listed below are 
all quite different both from our approach and from that 
of Chavez et al., in that they are set to emulate various 
versions of topological (i.e. essentially local) dimension. In 
particular, all of them fail both our Axioms 1 and 2. 

• Correlation dimension, which is a computationally ef- 
ficient version of the box-counting dimension, see p. 21|. 

• Packing dimension, or rather its computable version as 
proposed and explored in 



1^ 



• Distance exponent [2,21 1 which is a version of the well- 
known Minkowski dimension. 

• An algorithm for estimating the intrinsic dimension 
based on the Takens theorem from differential geometry 

• A non-local approach to intrinsic dimension estimation 
based on entropy-theoretic results is proposed in , how- 
ever in case of manifolds the algorithm will still return the 
topological dimension, so the same conclusions apply. 

8. Discussion 

We have proposed a new concept of the intrinsic di- 
mension of a dataset or, more generally, of a metric space 
equipped with a probability measure. Dimension functions 
of the new type behave in a very different way from the 
more traditional approaches, and are closer in spirit to, 
though still different from, the notion put forward in 5] 
(cf. a comparative discussion in [l8]). In particular, high 
intrinsic dimension indicates the presence of the curse of 
dimensionality, while lower dimension expresses the exis- 
tence of a small set of well-dissipating features and a pos- 
sibility of dimension reduction of X to a low-dimensional 
feature space. The intrinsic dimension of a random sample 
of a manifold is close to that of the manifold itself, and for 



standard geometric objects such as spheres or cubes the 
values returned by our dimension are "correct" . 

Two main problems pinpointed in this article are pro- 
hibitively high computational complexity of the new 
concepts, as well as their instability under random high- 
dimensional noise. 

The root cause of both problems is essentially the same: 
the class of all 1-Lipschitz functions is just too broad to 
serve as the set of admissible features. The richness of the 
spaces \jvpi{X) explains why computing concentration in- 
variants of an mm-space is hard: roughly speaking, there 
are just too many feature functions on the space that are to 
be examined one by one. The abundance of Lipschitz func- 
tions on a discrete metric space X is exactly what makes 
the Gromov distance from a random gaussian sample to a 
manifold large. 

At the same time, there is clearly no point in taking into 
account, as a potential feature, say, a typical polynomial 
function of degree 10 on the ambient space R^*"^, because 
such a function may contain up to 1.7 x 10^^ mono- 

mials. Since we cannot store, let even compute, such a func- 
tion, why should we care of it at all? 

A way out, as we see it, consists in refining the approach 
and modelling a dataset as a pair (X, ^) , consisting of an 
mm-space X together with a class of admissible features, 
^ C Lip]^(X), whose statistical learning capacity mea- 
sures (VC-dimension, covering numbers, Rademacher aver- 
ages, etc.) are limited. This will accurately reflect the fact 
that in practice one only uses features that are computa- 
tionally cheap, and will allow a systematic use of Vapnik- 
Chervonenkis theory. 

All the main concepts of asymptotic geometric analy- 
sis will have to be rewritten in the new framework, and 
this seems to be a potentially rewarding subject for further 
investigation. A theoretical challenge would to be obtain 
noise stability results under the general statistical assump- 
tions of 

Finally, the Gromov distance between two mm-spaces, 
X and Y , is determined on the basis of comparing the fea- 
tures of X and Y rather than the spaces themselves, which 
opens a possibility to try and construct an approximating 
principal manifold to X by methods of unsupervised ma- 
chine learning b y op timizing over suitable sets of Lipschitz 
functions, as in [23| . 

The concept of dimension in mathematics admits a very 
rich spectrum of interpretations. We feel that the topolog- 
ical versions of dimension have been dominating applica- 
tions in computing to the detriment of other approaches. 
We feel that the concept of dimension based on the view- 
point of asymptotic geometric analysis could be highly rel- 
evant to analysis of large sets of data, and we consider this 
article as a small step in the direction of developing this 
approach. 
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